control-flow graph
Machine learning-based malware detection for IoT devices using control-flow data
Embedded devices are specialised devices designed for one or only a few purposes. They are often part of a larger system, through wired or wireless connection. Those embedded devices that are connected to other computers or embedded systems through the Internet are called Internet of Things (IoT for short) devices. With their widespread usage and their insufficient protection, these devices are increasingly becoming the target of malware attacks. Companies often cut corners to save manufacturing costs or misconfigure when producing these devices. This can be lack of software updates, ports left open or security defects by design. Although these devices may not be as powerful as a regular computer, their large number makes them suitable candidates for botnets. Other types of IoT devices can even cause health problems since there are even pacemakers connected to the Internet. This means, that without sufficient defence, even directed assaults are possible against people. The goal of this thesis project is to provide better security for these devices with the help of machine learning algorithms and reverse engineering tools. Specifically, I study the applicability of control-flow related data of executables for malware detection. I present a malware detection method with two phases. The first phase extracts control-flow related data using static binary analysis. The second phase classifies binary executables as either malicious or benign using a neural network model. I train the model using a dataset of malicious and benign ARM applications.
MANDO: Multi-Level Heterogeneous Graph Embeddings for Fine-Grained Detection of Smart Contract Vulnerabilities
Nguyen, Hoang H., Nguyen, Nhat-Minh, Xie, Chunyao, Ahmadi, Zahra, Kudendo, Daniel, Doan, Thanh-Nam, Jiang, Lingxiao
Learning heterogeneous graphs consisting of different types of nodes and edges enhances the results of homogeneous graph techniques. An interesting example of such graphs is control-flow graphs representing possible software code execution flows. As such graphs represent more semantic information of code, developing techniques and tools for such graphs can be highly beneficial for detecting vulnerabilities in software for its reliability. However, existing heterogeneous graph techniques are still insufficient in handling complex graphs where the number of different types of nodes and edges is large and variable. This paper concentrates on the Ethereum smart contracts as a sample of software codes represented by heterogeneous contract graphs built upon both control-flow graphs and call graphs containing different types of nodes and links. We propose MANDO, a new heterogeneous graph representation to learn such heterogeneous contract graphs' structures. MANDO extracts customized metapaths, which compose relational connections between different types of nodes and their neighbors. Moreover, it develops a multi-metapath heterogeneous graph attention network to learn multi-level embeddings of different types of nodes and their metapaths in the heterogeneous contract graphs, which can capture the code semantics of smart contracts more accurately and facilitate both fine-grained line-level and coarse-grained contract-level vulnerability detection. Our extensive evaluation of large smart contract datasets shows that MANDO improves the vulnerability detection results of other techniques at the coarse-grained contract level. More importantly, it is the first learning-based approach capable of identifying vulnerabilities at the fine-grained line-level, and significantly improves the traditional code analysis-based vulnerability detection approaches by 11.35% to 70.81% in terms of F1-score.
A Library for Representing Python Programs as Graphs for Machine Learning
Bieber, David, Shi, Kensen, Maniatis, Petros, Sutton, Charles, Hellendoorn, Vincent, Johnson, Daniel, Tarlow, Daniel
A standard class of approaches in applying machine learning to code is to construct a graph representation of a program, and then to perform the analysis of interest on that graph representation, learning from a large dataset of labeled example programs. Graph representations of programs used for machine learning include the abstract syntax tree (AST), control-flow graph (CFG), data-flow graphs, inter-procedural control-flow graph (ICFG), interval graph, and composite "program graphs" that encode information from multiple of the aforementioned graphs, possibly with additional program-derived data. The python_graphs library directly allows for the construction of some of these graph types (e.g., control-flow graphs and composite program graphs) from arbitrary Python programs, and it provides tools that aid in constructing the others. It has been used successfully in a variety of machine learning for code publications, and we make it available as free and open source software to allow for broader use. In Section 2 we present an overview of the use of graph representations of code in machine learning. In Section 3 we describe the capabilities (Section 3.1), possible extensions (Section 3.2), and limitations (Section 3.3) of python_graphs. Section 4 highlights the applications of python_graphs for machine learning research. Section 5 presents a case study applying python_graphs to 3.3 million programs from Project CodeNet [28].